In [1]:
import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

General data exploration¶

  • Load data and covert to appropriate types
In [2]:
# load data
site_data = pd.read_csv('data/SiteData.csv').convert_dtypes()
display(site_data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13221 entries, 0 to 13220
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TIMESTAMP  13221 non-null  string 
 1    READING   13221 non-null  string 
 2    VALUE     13221 non-null  Float64
dtypes: Float64(1), string(2)
memory usage: 322.9 KB
None
  • change timestap to datetime
In [3]:
site_data['TIMESTAMP'] = pd.to_datetime(site_data['TIMESTAMP'])
site_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13221 entries, 0 to 13220
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   TIMESTAMP  13221 non-null  datetime64[ns]
 1    READING   13221 non-null  string        
 2    VALUE     13221 non-null  Float64       
dtypes: Float64(1), datetime64[ns](1), string(1)
memory usage: 322.9 KB
In [4]:
site_data.head()
Out[4]:
TIMESTAMP READING VALUE
0 2020-08-24 00:00:00 PX_GENERATOR_POWER 0.0
1 2020-08-24 00:00:00 PX_RECTIFIER_LOAD 2.3867
2 2020-08-24 00:00:00 PX_GENERATOR_1_FUEL_LEVEL 400.43
3 2020-08-24 00:30:00 PX_GENERATOR_POWER 0.0
4 2020-08-24 00:30:00 PX_RECTIFIER_LOAD 2.4072
  • there are empty spaces in column names, fix that
In [5]:
site_data.columns = site_data.columns.str.strip()
In [6]:
site_data['TIMESTAMP'].dt.month.unique()
Out[6]:
array([ 8,  9, 10, 11], dtype=int32)
In [7]:
site_data['READING'].unique()
Out[7]:
<StringArray>
['PX_GENERATOR_POWER', 'PX_RECTIFIER_LOAD', 'PX_GENERATOR_1_FUEL_LEVEL']
Length: 3, dtype: string

Distributions¶

The distributions generated are all interactive. You can hover, drage to zoom in, zoom out, reset axes and turn on/off individual categories by clicking on the legend.¶

In [8]:
# slice the data frame & create a copy for augmentation 
generator = site_data[site_data['READING'] == 'PX_GENERATOR_POWER']
generator_power = generator.copy()
rectifier= site_data[site_data['READING'] == 'PX_RECTIFIER_LOAD']
rectifier_load = rectifier.copy()
fuel = site_data[site_data['READING'] == 'PX_GENERATOR_1_FUEL_LEVEL']
fuel_level = fuel.copy()
In [9]:
# create box plots for each reading
fig = px.box(y=site_data['VALUE'], x = site_data['READING'], color=site_data['READING'], log_y=True, title='Statistics of each reading', labels={'y': 'value', 'x': 'reading'})
fig.update_layout()
fig.update_traces(boxmean=True)
fig.show()

1. What is the average, min, and max rectifier load?¶

2. What is the average daily variation in rectifier load?¶

In [10]:
# Calculate average, min, and max
average_load = rectifier_load['VALUE'].mean()
min_load = rectifier_load['VALUE'].min()
max_load = rectifier_load['VALUE'].max()

print(f'average load: {average_load} kW, min load: {min_load} kW, max load: {max_load} kW')

# Calculate daily variation
rectifier_load['DATE'] = rectifier_load['TIMESTAMP'].dt.date
daily_variation = rectifier_load.groupby('DATE')['VALUE'].std()
average load: 1.8169906058543226 kW, min load: 0.4999 kW, max load: 2.9144 kW
In [11]:
# 1. Distribution of Rectifier Load with Average, Min, and Max
fig = px.histogram(rectifier_load, x='VALUE', nbins=50, title='Distribution of Rectifier Load with Average, Min, and Max')
fig.add_vline(x=average_load, line_dash="dash", line_color="red", annotation_text=f'Average: {average_load:.2f} kW')
fig.add_vline(x=min_load, line_dash="dash", line_color="green", annotation_text=f'Min: {min_load:.2f} kW')
fig.add_vline(x=max_load, line_dash="dash", line_color="blue", annotation_text=f'Max: {max_load:.2f} kW')
fig.update_layout(xaxis_title='Rectifier Load (kW)', yaxis_title='Frequency')
fig.show()
In [12]:
avg_daily_variation = daily_variation.mean()
print(f'average daily variation is {avg_daily_variation} kW')
average daily variation is 0.08713957714464081 kW
In [13]:
# 2. Average Daily Variation in Rectifier Load
daily_variation_data = rectifier_load.groupby('DATE')['VALUE'].std().reset_index()
fig = px.bar(daily_variation_data, x='DATE', y='VALUE', title='Daily Variation in Rectifier Load')
fig.add_hline(y=avg_daily_variation, line_dash="dash", line_color="red", annotation_text=f'Average Daily Variation: {avg_daily_variation:.2f} kW')
fig.update_layout(xaxis_title='Date', yaxis_title='Standard Deviation of Rectifier Load (kW)')
fig.show()
In [14]:
# plot the distribution of the daily deviation
fig = px.histogram(x=daily_variation, marginal='box', labels={'x':'standard deviation (kW)'}, title='Distribution of the daily vairation in rectifier load')
fig.update_layout()
fig.update_traces()
fig.show()

Note: although the mean variation is 0.08 kW, it is also good to have an idea of outliers which can help identify incidents with large load variation.

3. You will notice that the rectifier load has several step changes during the 3-month period. Does the generator usage on site change? How?¶

  • Lets start by plotting the time series distribution for each reading with a common time stamp
  • In additiona will also check the correlation of rectifier load and generator power
In [15]:
# Create subplots
fig = make_subplots(rows=2, cols=1, shared_xaxes=True, subplot_titles=['PX_GENERATOR_POWER', 'PX_RECTIFIER_LOAD'])


fig.add_trace(go.Scatter(x=generator_power['TIMESTAMP'], y=generator_power['VALUE'], mode='lines+markers', name='PX_GENERATOR_POWER'), row=1, col=1)
fig.add_trace(go.Scatter(x=rectifier_load['TIMESTAMP'], y=rectifier_load['VALUE'], mode='lines+markers', name='PX_RECTIFIER_LOAD'), row=2, col=1)
# fig.add_trace(go.Scatter(x=fuel_level['TIMESTAMP'], y=fuel_level['VALUE'], mode='lines+markers', name='PX_GENERATOR_1_FUEL_LEVEL'), row=3, col=1)

# Update layout
fig.update_layout(
    title='Time Series for Each READING',
    xaxis_title='Timestamp',
    yaxis_title='Value',
    height=600,
    xaxis_rangeslider_visible=False
)

fig.show()
  • time-series plot: comparing the time series plots of rectifier load and generator power, it does look like there is a positive correlation betweed load and power, which make sense.
  • Lets investivage some scatter plots. Using the previous day to day rectifier load deviation plot, lets define a thershold of 0.02 as a step change in the rectifier load
In [16]:
# Calculate step changes
rectifier_load['STEP_CHANGE'] = rectifier_load['VALUE'].diff().abs()

# Define a threshold for significant step changes
threshold = 0.02

# Identify significant step changes in rectifier load
significant_rect_changes = rectifier_load[rectifier_load['STEP_CHANGE'] > threshold]
generator_power['VALUE'] = generator_power['VALUE'].astype(float)


# Merge with generator power to find corresponding values
merged_data = pd.merge_asof(significant_rect_changes.sort_values('TIMESTAMP'),
                            generator_power.sort_values('TIMESTAMP'),
                            on='TIMESTAMP', suffixes=('_rectifier', '_generator'))
# print(merged_data.columns)


# Plot the distributions where rectifier load change is above 0.02 kW 
fig = go.Figure()
# Add rectifier load step changes
fig.add_trace(go.Scatter(x=merged_data['TIMESTAMP'], y=merged_data['VALUE_rectifier'], mode='markers', name='Significant Rectifier Changes', marker=dict(color='red'), yaxis='y2'))
# Add corresponding generator power
fig.add_trace(go.Scatter(x=merged_data['TIMESTAMP'], y=merged_data['VALUE_generator'], mode='markers', name='Corresponding Generator Power', marker=dict(color='blue')))
# Update layout for dual y-axis
fig.update_layout(
  title='Significant Rectifier Load Step Changes and Corresponding Generator Power', 
  xaxis_title='Timestamp',
    yaxis=dict(
        title='Generator Power',
        titlefont=dict(color='blue'),
        tickfont=dict(color='blue')
    ),
    yaxis2=dict(
        title='Rectifier Load',
        titlefont=dict(color='red'),
        tickfont=dict(color='red'),
        anchor='free',
        overlaying='y',
        side='right',
        position=1
    ),
    legend=dict(x=0, y=1.1, orientation='h')
)

fig.show()

NOTE: there are a lot of cases where power is 0 even though load is increasing. These cases can be considered as when the generator is not in use. However even when the generator is in use, the power consumption seems to fluctuate a lot. Is there a load balancer(?) that only draws power from generator when needed? So we are not seeing a pefect correclation.

Lets calculate a correlation coefficient ignoring the cases where power is 0

In [17]:
# correlation ignoring 0 power in generator 
merged_data_ON = merged_data[merged_data['VALUE_generator'] != 0]

# Calculate correlation between rectifier step changes and generator power
correlation_ON = merged_data_ON[['STEP_CHANGE', 'VALUE_generator']].corr().iloc[0, 1]

fig = px.scatter(merged_data_ON, x='STEP_CHANGE', y='VALUE_generator', 
                 title=f'Correlation between Rectifier Load Step Changes and Generator Power (ignoring OFF state) \nCorrelation Coefficient: {correlation_ON:.2f}',
                 labels={'STEP_CHANGE': 'Rectifier Load Step Change (kW)', 'VALUE_generator': 'Generator Power (kW)'},
                 trendline="ols")
fig.show()

We see a slightly positive correlation, which will make sense to me.

4. How many refuellings of the generator are there, and how big are they?¶

5. If the client wants to refuel when the generator drops to 50L of fuel, when do you predict they will next need to refuel?¶

In [18]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=fuel_level['TIMESTAMP'], y=fuel_level['VALUE'], mode='lines+markers'))
fig.show()

Identify refueling events - detect increases in fuel level¶

  • Assumpition: the sensors can detect any changes more than 500 ml reliably
  • Use a threshold of 0.5
In [19]:
fuel_level['FUEL_DIFF'] = fuel_level['VALUE'].diff()
refueling_events = fuel_level[fuel_level['FUEL_DIFF'] > 0.5] # any positive change is refueling 
  • Lets plot the refueling events
In [20]:
fig = px.line(fuel_level, x='TIMESTAMP', y='VALUE', title='Fuel Level with Refueling Events', line_shape='linear')
fig.add_trace(go.Scatter(x=refueling_events['TIMESTAMP'], y=refueling_events['VALUE'], mode='markers', name='Refueling Events', marker=dict(color='red')))
fig.show()
In [21]:
# Calculate the size of each refueling event
refueling_sizes = refueling_events[['TIMESTAMP', 'FUEL_DIFF']].reset_index(drop=True)
print(f'Total number of refueling = {refueling_sizes.shape[0]}') 
display(refueling_sizes)
Total number of refueling = 35
TIMESTAMP FUEL_DIFF
0 2020-08-25 12:00:00 4.8
1 2020-08-25 13:30:00 2.73
2 2020-08-26 14:00:00 5.48
3 2020-08-29 11:00:00 3.42
4 2020-08-29 18:30:00 2.74
5 2020-08-31 20:30:00 5.48
6 2020-09-06 12:30:00 5.48
7 2020-09-08 09:30:00 0.69
8 2020-09-08 12:30:00 3.42
9 2020-09-11 11:30:00 479.04
10 2020-09-12 16:30:00 5.48
11 2020-09-15 14:00:00 6.16
12 2020-09-16 09:00:00 2.74
13 2020-09-16 12:30:00 2.74
14 2020-09-17 09:30:00 2.74
15 2020-09-19 15:30:00 5.48
16 2020-09-21 10:30:00 2.73
17 2020-09-23 15:00:00 105.39
18 2020-09-24 13:30:00 5.48
19 2020-10-06 08:30:00 2.74
20 2020-10-08 10:30:00 2.73
21 2020-10-08 11:30:00 2.74
22 2020-10-11 12:30:00 4.11
23 2020-10-12 11:30:00 4.11
24 2020-10-23 17:00:00 257.31
25 2020-10-25 16:00:00 8.9
26 2020-11-01 15:00:00 2.73
27 2020-11-07 17:30:00 166.98
28 2020-11-13 12:30:00 4.79
29 2020-11-17 12:30:00 5.47
30 2020-11-18 13:00:00 2.06
31 2020-11-19 15:00:00 5.47
32 2020-11-22 14:00:00 2.74
33 2020-11-23 12:00:00 2.73
34 2020-11-23 13:00:00 2.74
In [22]:
# Predict next refuel time (when fuel drops to 50L)
current_fuel_level = fuel_level['VALUE'].iloc[-1]
# print(current_fuel_level)
fuel_depletion_rate = fuel_level['VALUE'].diff().mean()  # average rate of fuel consumption per 30 minutes
# print(fuel_depletion_rate)
time_to_next_refuel = (current_fuel_level - 50) / fuel_depletion_rate #mins
# print(time_to_next_refuel)
next_refuel_time = fuel_level['TIMESTAMP'].iloc[-1] + pd.Timedelta(minutes=time_to_next_refuel) # current time + delta time 

print(f'Next refuel time = {next_refuel_time}')
Next refuel time = 2021-01-21 23:18:13.896713612

6. The generator and rectifier both display step changes over the time-period. How might you automatically detect and quantify these changes?¶

Similarly to 3, we have to define a threshold. For rectifier load we used the minimum of the daily variation as a threshold. We can do that same for generator_power

In [23]:
generator_power['DATE'] = generator_power['TIMESTAMP'].dt.date
power_daily_variation = generator_power.groupby('DATE')['VALUE'].std()
In [24]:
fig = px.histogram(x=power_daily_variation, marginal='box', labels={'x':'standard deviation (kW)'}, title='Distribution of the daily vairation in generator power')
fig.update_layout()
fig.update_traces()
fig.show()
In [25]:
# Re-calculate step changes in rectifier load
rectifier_load['RECTIFIER_LOAD_DIFF'] = rectifier_load['VALUE'].diff().abs()

# Step change detection for generator power
generator_power['GENERATOR_DIFF'] = generator_power['VALUE'].diff().abs()
significant_gen_changes = generator_power[generator_power['GENERATOR_DIFF'] > 1.5] # using 1.5 as a threshold based on above distribution

# Step change detection for rectifier load
significant_rect_changes = rectifier_load[rectifier_load['RECTIFIER_LOAD_DIFF'] > 0.02 ] # using 0.02 from #3 

# Results
display(significant_gen_changes[['TIMESTAMP', 'VALUE', 'GENERATOR_DIFF']]) 
display(significant_rect_changes[['TIMESTAMP', 'VALUE', 'RECTIFIER_LOAD_DIFF']])
TIMESTAMP VALUE GENERATOR_DIFF
9 2020-08-24 01:30:00 3.4697 3.4697
12 2020-08-24 02:00:00 10.4526 6.9829
15 2020-08-24 02:30:00 2.0011 8.4515
18 2020-08-24 03:00:00 0.0000 2.0011
33 2020-08-24 05:30:00 10.2884 10.1695
... ... ... ...
13107 2020-11-23 05:00:00 2.4821 6.0855
13110 2020-11-23 05:30:00 0.0000 2.4821
13194 2020-11-23 19:30:00 1.6934 1.6934
13197 2020-11-23 20:00:00 8.5276 6.8342
13203 2020-11-23 21:00:00 0.0000 8.5088

802 rows × 3 columns

TIMESTAMP VALUE RECTIFIER_LOAD_DIFF
4 2020-08-24 00:30:00 2.4072 0.0205
10 2020-08-24 01:30:00 2.3504 0.0561
13 2020-08-24 02:00:00 2.3007 0.0497
16 2020-08-24 02:30:00 2.3592 0.0585
25 2020-08-24 04:00:00 2.3949 0.0333
... ... ... ...
13186 2020-11-23 18:00:00 1.7483 0.0389
13189 2020-11-23 18:30:00 1.7196 0.0287
13192 2020-11-23 19:00:00 1.5908 0.1288
13195 2020-11-23 19:30:00 1.7133 0.1225
13198 2020-11-23 20:00:00 1.6861 0.0272

2223 rows × 3 columns

Note, the threshold for each case is a choice based on the current data. We can change this threshold based on other requirements.

7. Solar power is an increasingly common power source utilised by tower companies to reduce costs (fuel) and carbon emissions. One significant cause of panel output being lower than anticipated is shading from surrounding trees, buildings or the tower itself. The image below shows how shading can impact a solar power profile – a solar power profile is generally a parabola shape, a shading profile can take many different forms, but the pattern must repeat over multiple days. How might you automatically detect and quantify this problem?¶

image.png

We can implement the following steps to quantify and detect shading

  1. Collect data over a week: We can analyze the daily power profiles to visualize how the daily solar power output is over the hours of the day. We can even use simulated data by generating it from a parabolic equation that reflects day time hours.
  2. Identify deviation from the parabolic shape (e.g. the plot above) i.e. the difference is power from step 1 over a week to make sure the pattern is repeated.
  3. Define a threshold of decrease in power.
  4. Deviation beyond this threshold will identify shading effect automatically.